NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

IPComp: Interpolation Based Progressive Lossy Compression for Scientific Applications

Yang, Z; Di, S; Zhang, L; Li, R; Li, X; Huang, J; Liu, J; Cappello, F; Zhao, K (July 2025, The 34th ACM International Symposium on High-Performance Parallel and Distributed Computing)

Free, publicly-accessible full text available July 23, 2026
Lossy Compression of Scientific Data: Applications Constrains and Requirements

https://doi.org/10.48550/arXiv.2503.20031

Cappello, F; Baker, A; Bozdağ, E; Burtscher, M; Chard, K; Di, S; Grady, P; Jiang, P; Li, S; Lindahl, E; et al (March 2025, arXiv)

Free, publicly-accessible full text available March 25, 2026
DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

https://doi.org/10.1109/CCGrid49817.2020.00-76

Nicolae, B.; Li, J.; Wozniak, J..; Bosilca, G.; Dorier, M.; Cappello, F. (May 2020, 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID))

In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging in such applications is frequent checkpointing of the state of the learning model during training, needed in a variety of scenarios: analysis of intermediate states to explain features and correlations with training data, exploration strategies involving alternative models that share a common ancestor, knowledge transfer, resilience, etc. However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization performance, blocking I/O, stragglers due to the fact that only a single process is involved in checkpointing. This paper proposes a checkpointing technique specifically designed to address the aforementioned limitations, introducing efficient asynchronous techniques to hide the overhead of serialization and I/O, and distribute the load over all participating processes. Experiments with two deep learning applications (CANDLE and ResNet) on a pre-Exascale HPC platform (Theta) shows significant improvement over state-of-art, both in terms of checkpointing duration and runtime overhead.
more » « less
Full Text Available
DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Nicolae, B.; Li, J.; Wozniak, J. M.; Bosilca, G.; Dorier, M.; Cappello, F. (May 2020, 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID))
null (Ed.)
Full Text Available
Towards Portable Online Prediction of Network Utilization Using MPI-Level Monitoring

https://doi.org/10.1007/978-3-030-29400-7_4

Tseng, S-M.; Nicolae, B.; Bosilca, G.; Jeannot, E.; Chandramowlishwaran, A.; Cappello, F. (August 2019, 2019 European Conference on Parallel Processing (Euro-Par 2019))
null (Ed.)
Full Text Available

Search for: All records